03:00
The Big Picture: How well does our model explain the data?
Three key quantities:
- Total variation in the data
- Variation explained by our model
- Variation left unexplained (residuals)
Our model: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\)
Fitted values: \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{H}\mathbf{y}\)
where \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) is the “hat matrix”
Residuals: \(\hat{\boldsymbol{\varepsilon}} = \mathbf{y} - \hat{\mathbf{y}}\)
Definition: Total variation in the response variable \(\mathbf{y}\)
Formula: \[\text{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2\]
Definition: Total variation in the response variable \(\mathbf{y}\)
Matrix form: \[\text{TSS} = (\mathbf{y} - \bar{y}\mathbf{1})^T(\mathbf{y} - \bar{y}\mathbf{1})\]
where \(\mathbf{1}\) is a vector of ones and \(\bar{y}\) is the sample mean
Think of TSS as: “How spread out are my y-values?”
Key insight: This is what we’re trying to explain with our model
Alternative matrix form: \[\text{TSS} = \mathbf{y}^T\mathbf{y} - n\bar{y}^2\]
Given: \(\mathbf{y} = \begin{bmatrix} 2 \\ 4 \\ 6 \\ 8 \end{bmatrix}\)
Find: TSS using both the definition and matrix form
03:00
Step 1: Calculate \(\bar{y} = \frac{2+4+6+8}{4} = 5\)
Step 2: Deviations from mean: \(\begin{bmatrix} -3 \\ -1 \\ 1 \\ 3 \end{bmatrix}\)
Method 1 (definition): \[\text{TSS} = (-3)^2 + (-1)^2 + 1^2 + 3^2 = 9 + 1 + 1 + 9 = 20\]
Method 2 (matrix form): \[\text{TSS} = 2^2 + 4^2 + 6^2 + 8^2 - 4(5^2) = 120 - 100 = 20\]
Definition: Variation left unexplained by our model
Formula: \[\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \hat{\varepsilon}_i^2\]
Definition: Variation left unexplained by our model
Matrix form: \[\text{SSE} = \hat{\boldsymbol{\varepsilon}}^T\hat{\boldsymbol{\varepsilon}} = (\mathbf{y} - \hat{\mathbf{y}})^T(\mathbf{y} - \hat{\mathbf{y}})\]
Since \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\): \[\text{SSE} = (\mathbf{y} - \mathbf{H}\mathbf{y})^T(\mathbf{y} - \mathbf{H}\mathbf{y})\]
Factor out \(\mathbf{y}\): \[= \mathbf{y}^T(\mathbf{I} - \mathbf{H})^T(\mathbf{I} - \mathbf{H})\mathbf{y}\]
Key property: \(\mathbf{I} - \mathbf{H}\) is symmetric and idempotent
\[\text{SSE} = \mathbf{y}^T(\mathbf{I} - \mathbf{H})\mathbf{y}\]
Think of SSE as: “How much did we miss with our model?”
Smaller SSE = Better fit
Larger SSE = Worse fit
Perfect fit: SSE = 0 (model explains everything)
Definition: Variation explained by our model
Formula: \[\text{SS}_{\text{Reg}} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\]
Definition: Variation explained by our model
Matrix form: \[\text{SS}_{\text{Reg}} = (\hat{\mathbf{y}} - \bar{y}\mathbf{1})^T(\hat{\mathbf{y}} - \bar{y}\mathbf{1})\]
Since \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\): \[\text{SS}_{\text{Reg}} = (\mathbf{H}\mathbf{y} - \bar{y}\mathbf{1})^T(\mathbf{H}\mathbf{y} - \bar{y}\mathbf{1})\]
Alternative form: \[\text{SS}_{\text{Reg}} = \mathbf{y}^T\mathbf{H}\mathbf{y} - n\bar{y}^2\]
Think of \(\textrm{SS}_\textrm{Reg}\) as: “How much variation did our model capture?”
Larger \(\textrm{SS}_\textrm{Reg}\) = Model explains more
Smaller \(\textrm{SS}_\textrm{Reg}\) = Model explains less
No model: \(\textrm{SS}_\textrm{Reg}\) = 0 (just predicting the mean)
The key relationship: \[\text{Total Variation} = \text{Unexplained} + \text{Explained}\]
Challenge: Show that TSS = SSE + \(\textrm{SS}_\textrm{Reg}\) using matrix algebra
Hint: Start with the definitions and use properties of the hat matrix
05:00
Start with: TSS = \(\mathbf{y}^T\mathbf{y} - n\bar{y}^2\)
Key insight: We need to decompose \(\mathbf{y}\) as \(\mathbf{y} = \hat{\mathbf{y}} + \hat{\boldsymbol{\varepsilon}}\)
Then show: Cross terms vanish due to orthogonality
Write: \(\mathbf{y} = \hat{\mathbf{y}} + \hat{\boldsymbol{\varepsilon}} = \mathbf{H}\mathbf{y} + (\mathbf{I}-\mathbf{H})\mathbf{y}\)
Then: \[\mathbf{y}^T\mathbf{y} = (\mathbf{H}\mathbf{y} + (\mathbf{I}-\mathbf{H})\mathbf{y})^T(\mathbf{H}\mathbf{y} + (\mathbf{I}-\mathbf{H})\mathbf{y})\]
Four terms: \[= \mathbf{y}^T\mathbf{H}^T\mathbf{H}\mathbf{y} + \mathbf{y}^T\mathbf{H}^T(\mathbf{I}-\mathbf{H})\mathbf{y}\] \[+ \mathbf{y}^T(\mathbf{I}-\mathbf{H})^T\mathbf{H}\mathbf{y} + \mathbf{y}^T(\mathbf{I}-\mathbf{H})^T(\mathbf{I}-\mathbf{H})\mathbf{y}\]
Key property: \(\mathbf{H}(\mathbf{I}-\mathbf{H}) = \mathbf{0}\) (orthogonal projections)
Therefore: The cross terms equal zero
Result: \[\mathbf{y}^T\mathbf{y} = \mathbf{y}^T\mathbf{H}\mathbf{y} + \mathbf{y}^T(\mathbf{I}-\mathbf{H})\mathbf{y}\]
Subtract \(n\bar{y}^2\) from both sides: \[\text{TSS} = \text{SS}_{\text{Reg}} + \text{SSE}\]
Beautiful result: Total variation splits perfectly into explained and unexplained parts
Definition: Proportion of total variation explained by the model
Formula: \[R^2 = \frac{\text{SS}_{\text{Reg}}}{\text{TSS}} = 1 - \frac{\text{SSE}}{\text{TSS}}\]
Range: \(0 \leq R^2 \leq 1\)
R² = 0.8 means “80% of variation is explained by the model”
Perfect fit: R² = 1 (model explains everything)
No relationship: R² = 0 (model explains nothing)
Warning: High R² doesn’t always mean good model!
Given: TSS = 100, SSE = 25
Find: R² and interpret the result
02:00
Method 1: \[R^2 = 1 - \frac{\text{SSE}}{\text{TSS}} = 1 - \frac{25}{100} = 0.75\]
Method 2: \[\text{SS}_{\text{Reg}} = \text{TSS} - \text{SSE} = 100 - 25 = 75\] \[R^2 = \frac{75}{100} = 0.75\]
Interpretation: The model explains 75% of the variation in y
# Load the famous Anscombe's Quartet
data(anscombe)
# All have same R²!
lm1 <- lm(y1 ~ x1, data = anscombe)
lm2 <- lm(y2 ~ x2, data = anscombe)
lm3 <- lm(y3 ~ x3, data = anscombe)
lm4 <- lm(y4 ~ x4, data = anscombe)
c(summary(lm1)$r.squared, summary(lm2)$r.squared,
summary(lm3)$r.squared, summary(lm4)$r.squared)[1] 0.6665425 0.6662420 0.6663240 0.6667073
Single predictor: Is \(\beta_1\) significantly different from 0?
Multiple predictors: Are any of the predictors useful?
Subset test: Is a group of predictors jointly significant?
Null hypothesis: \(H_0: \beta_j = 0\)
Alternative: \(H_1: \beta_j \neq 0\)
Test statistic: \[t = \frac{\hat{\beta}_j}{\text{se}(\hat{\beta}_j)}\]
Standard error: \(\text{se}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}}\)
Recall: \(\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\)
For individual coefficient j: \[\text{Var}(\hat{\beta}_j) = \sigma^2[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}\]
Estimate \(\sigma^2\): \(\hat{\sigma}^2 = \frac{\text{SSE}}{n-p}\) where p = number of parameters
Under normality assumption: \[\frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)} \sim t_{n-p}\]
For testing \(H_0: \beta_j = 0\): \[t = \frac{\hat{\beta}_j}{\text{se}(\hat{\beta}_j)} \sim t_{n-p}\]
Null hypothesis: \(H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0\)
Alternative: At least one \(\beta_j \neq 0\) (j ≠ 0)
This tests: “Is the model useful at all?”
Test statistic: \[F = \frac{\text{SS}_{\text{Reg}}/(p-1)}{\text{SSE}/(n-p)} = \frac{\text{Mean Square Regression}}{\text{Mean Square Error}}\]
Under \(H_0\): \(F \sim F_{p-1, n-p}\)
Numerator: How much variation does model explain per parameter?
Denominator: How much unexplained variation per residual degree of freedom?
Large F: Model explains a lot relative to noise
Small F: Model doesn’t explain much more than noise
Alternative F-statistic form: \[F = \frac{R^2/(p-1)}{(1-R^2)/(n-p)}\]
This shows: F-test is really testing whether R² is significantly different from 0
General form: \(H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{d}\)
Where:
- \(\mathbf{C}\) is a contrast matrix (q × p)
- \(\mathbf{d}\) is a vector of constants
- q is the number of restrictions
Single coefficient: \(\mathbf{C} = [0, 1, 0, 0]\), \(\mathbf{d} = 0\)
Tests: \(\beta_1 = 0\)
Overall test: \(\mathbf{C} = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\), \(\mathbf{d} = \mathbf{0}\)
Tests: All slope coefficients = 0
Equality test: \(\mathbf{C} = [0, 1, -1, 0]\), \(\mathbf{d} = 0\)
Tests: \(\beta_1 = \beta_2\)
Test statistic: \[F = \frac{(\mathbf{C}\hat{\boldsymbol{\beta}} - \mathbf{d})^T[\mathbf{C}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{C}^T]^{-1}(\mathbf{C}\hat{\boldsymbol{\beta}} - \mathbf{d})/q}{\text{SSE}/(n-p)}\]
Under \(H_0\): \(F \sim F_{q, n-p}\)
Scenario: Three predictors, want to test if the last two coefficients are both zero
Model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\)
Question: What are \(\mathbf{C}\) and \(\mathbf{d}\) for \(H_0: \beta_2 = \beta_3 = 0\)?
03:00
Answer: \[\mathbf{C} = \begin{bmatrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}, \quad \mathbf{d} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\]
Check: \(\mathbf{C}\boldsymbol{\beta} = \begin{bmatrix} \beta_2 \\ \beta_3 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\) ✓
| Source | df | Sum of Squares | Mean Square | F |
|---|---|---|---|---|
| Regression | p-1 | \(\textrm{SS}_\textrm{Reg}\) | \(\textrm{MS}_\textrm{Reg}\) | \(\textrm{MS}_\textrm{Reg}\)/MSE |
| Error | n-p | SSE | MSE | |
| Total | n-1 | TSS |
Key relationships:
- TSS = \(\mathbf{y}^T\mathbf{y} - n\bar{y}^2\)
- \(\textrm{SS}_\textrm{Reg}\) = \(\mathbf{y}^T\mathbf{H}\mathbf{y} - n\bar{y}^2\)
- SSE = \(\mathbf{y}^T(\mathbf{I} - \mathbf{H})\mathbf{y}\)
- TSS = \(\textrm{SS}_\textrm{Reg}\) + SSE
R² = \(\textrm{SS}_\textrm{Reg}\)/TSS = 1 - SSE/TSS
Step 1: Overall F-test (is model useful?)
Step 2: Subset F-tests (are groups of predictors significant?)
Step 3: Individual t-tests (which predictors matter?)
Definition: The probability of observing a test statistic as extreme or more extreme than what we observed, assuming the null hypothesis is true
In other words: “How surprising is our result if \(H_0\) were true?”
For testing \(H_0: \beta_j = 0\):
Test statistic: \(t = \frac{\hat{\beta}_j}{\text{se}(\hat{\beta}_j)}\)
Two-sided p-value:
\[\text{p-value} = P(|T| \geq |t|) = 2 \times P(T \geq |t|)\] where \(T \sim t_{n-p}\)
For overall test \(H_0:\) all slopes = 0:
Test statistic: \(F = \frac{\text{SS}_{\text{Reg}}/(p-1)}{\text{SSE}/(n-p)}\)
One-sided p-value:
\[\text{p-value} = P(F_{p-1,n-p} \geq f)\] where \(f\) is our observed F-statistic
Scenario: Testing \(H_0: \beta_1 = 0\) with t = 2.8 and df = 18
Given: p-value = 0.012
Questions:
03:00
Interpretation: If \(\beta_1 = 0\) were true, there’s only a 1.2% chance of seeing a t-statistic as extreme as ±2.8 or more extreme
At α = 0.05: Reject \(H_0\) (p = 0.012 < 0.05) - Evidence suggests \(\beta_1 \neq 0\)
At α = 0.01: Fail to reject \(H_0\) (p = 0.012 > 0.01) - Not enough evidence at this stricter level
For t-tests: Use pt() function
For F-tests: Use pf() function
Both give: Cumulative distribution function (CDF) values
pt()Two-sided test: \(H_0: \beta_j = 0\)
pt()Why the formula?
pt(2.8, 18) gives P(T ≤ 2.8)1 - pt(2.8, 18) gives P(T > 2.8)pf()One-sided test: \(H_0:\) all slopes = 0
pf()Why one-sided? F-statistics are always ≥ 0, so we only care about the right tail
Given: - t-statistic = -1.96, df = 24 - F-statistic = 8.5, df1 = 2, df2 = 20
Calculate both p-values using R
03:00
# t-test p-value (two-sided)
t_val <- -1.96
df_t <- 24
p_t <- 2 * pt(abs(t_val), df_t, lower.tail = FALSE)
cat("t-test p-value:", round(p_t, 4))t-test p-value: 0.0617
# F-test p-value (one-sided)
f_val <- 8.5
df1_f <- 2
df2_f <- 20
p_f <- pf(f_val, df1_f, df2_f, lower.tail = FALSE)
cat("\nF-test p-value:", round(p_f, 4))
F-test p-value: 0.0021
Interpretation: - t-test: p = 0.0614 (not significant at α = 0.05) - F-test: p = 0.0021 (significant at α = 0.05)
What p-values DON’T tell us:
Remember: Statistical significance ≠ practical significance
Best practice: Report both p-values AND effect sizes (\(\hat{\beta}_j\))
P-values help us:
But remember:
Given: n = 20, p = 4, TSS = 1000, SSE = 300
Calculate:
- R²
- Overall F-statistic
- Conclude at α = 0.05 level
05:00
R² calculation: \[R^2 = 1 - \frac{SSE}{TSS} = 1 - \frac{300}{1000} = 0.7\]
F-statistic: \[F = \frac{SS_{Reg}/(p-1)}{SSE/(n-p)} = \frac{700/3}{300/16} = \frac{233.33}{18.75} = 12.44\]
Critical value: \(F_{0.05, 3, 16} = 3.24\)
Conclusion: Reject \(H_0\). The model is statistically significant.